Mon. Not. R. Astron. Soc. 000, ITHHl fOOOOl Printed 3 March 2013 (MN WFe& style file v2.2) 



An Efficient Parameter Space Search as an Alternative to 
Markov Chain Monte Carlo 



£^ ; Scott F. Daniel 1 , Andrew J. Connolly 1 , and Jeff Schneider 2 



1 Department of Astronomy, University of Washington, Seattle, WA 

2 Machine Learning Department, Carnegie Mellon University, Pittsburgh, 



3 March 2013 



PA 



ABSTRACT 

We consider the problem of inferring constraints on a high-dimensional parameter 
space with a computationally expensive likelihood function. Markov chain Monte Carlo 
(MCMC) methods offer significant improvements in efficiency over grid-based searches 
and are easy to implement in a wide range of cases. However, MCMCs offer few guar- 
antees that all of the interesting regions of parameter space are explored. We propose a 
machine learning algorithm that improves upon the performance of MCMC by intelli- 
gently targeting likelihood evaluations so as to quickly and accurately characterize the 
likelihood surface in both low- and high-likelihood regions. We compare our algorithm 
to MCMC on toy examples and the 7-ycar WMAP cosmic microwave background data 
release. Our algorithm finds comparable parameter constraints to MCMC in fewer calls 
to the likelihood function and with greater certainty that all of the interesting regions 
of parameter space have been explored. 



1 INTRODUCTION 



Researchers in many fields use Markov chain Monte 
Carlo (MCMC) methods to translate data into inferences 
on high-dimensional parameter spaces ()Gilks et al. 1996). 
Studies of the cosmic microwave background (CMB) 
anisotropy spectra are a perfect example of this prac- 
tice (|Lewis and Bridle 20021 |Dunkley et al. 2005| . The con- 
cordance cosmology requires six or seven (depending on 
whether or not one assumes spatial flatness) parameters to 
theoretically describe the spectrum of CMB anisotropics. 
Given the computational expense of converting just one 
point in this parameter space into an anisotropy spec- 
trum (1.3 seconds using the Boltzmann code CAMB 
( |Lewis, Challinor, and Lasenby 2000[ ) on a 2.5 GHz dual- 
core machine) and then comparing that spectrum to the 
data (2 seconds using the official WMAP likelihood code 
(WMAP 2010)), exhaustively exploring the parameter space 
in search of models that fit the data well is unfeasible. 
MCMC is an alternative to this costly process that ran- 
domly walks through parameter space such that exploration 
theoretically locates and confines itself to regions where the 
fit to the data is good, and thus integrates the distribution of 
parameter values only over that space where it has high sup- 
port. Integrations that might have taken months when done 
exhaustively can be accomplished in days; those that might 
have taken days can be accomplished in hours. It is fair to 



say that MCMC has become an "industry standard" within 
some communities. In spite of this success, serious questions 
remain. MCMC is a sampling method and its theoretical 
guarantees are mostly with respect to convergence rather 
than sampling efficiency ( Atchadc and Rosenthal 2005 ) . Pa- 
rameter fitting problems are search problems, not sampling 
problems. There is no fundamental reason that a sampling 
algorithm should be good for searching and certainly no 
guarantee that the performance of MCMC as a search al- 
gorithm with a finite number of samples will be particularly 
efficient. Finally, MCMC methods generally force investiga- 
tors to make Bayesian assumptions which may or may not 
be desirable. 

We present an algorithm originally proposed and im- 
plemented by ( |Bryan et al. 2007b) as a more efficient and 
flexible alternative to MCMC. We will refer to the algorithm 
as the Active Parameter Search (APS ) algorithm. We pro- 
vide C++ code to implemement APS (publically available 
at https://github.com/uwssg/APS) and test it on the 7- 
year release of the WMAP CMB data (|Jarosik et al. 2011[) . 
Section[5]presents an overview of both the MCMC and APS 
algorithms and discusses the shortcomings of the former and 
how the latter attempts to address them. Section [3] details 
the user-specified parameters necessary to APS and some of 
the attendant considerations. Section [4] compares the per- 
formance of MCMC and APS on a toy model of a multi- 
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modal likelihood function. Section [5] presents the results of 
the WMAP 7 test. We find that APS achieves parameter 
constraints at comparable (if not faster) times to MCMC 
while simultaneously exploring the parameter space more 
broadly. 



2 MCMC 

We begin this section with a general discussion of the pro- 
cess of deriving parameter constraints from data. We then 
sketch an MCMC method called the Metropolis-Hastings al- 
gorithm (|Gilks et al. 19961 ITewls and Bridle 20020 . We will 
outline the APS algorithm with an aside discussing Gaus- 
sian processes, a method of multi-dimensional function fit- 
ting which occupies some prominence in the algorithm. We 
end this section by directly comparing APS with MCMC on 
a toy model in a two-dimensional parameter space. 

The basic problem both MCMC and APS attempt to 
address is the following: suppose there is a theory described 
by some number of parameters 8 = {8i, 82, 63, . . . }. This 
theory makes a prediction about some function / which 
is tested by an experiment resulting in Nd data points 
{di, d,2, ■ ■ ■ , d,N D }. To quantify what these data say about 
the theory, an investigator wants to ask what combinations 
of values of 8 result in predictions that are consistent with 
the data to some threshold probability (100 — a)%. Usually, 
this is quantified by assuming that the Nd data points are 
independent and normally distributed so that the statistic 

(where Oi is the uncertainty associated with the di data- 
point) is a random variable distributed according to the \ 2 
probability distribution V x i with Nd degrees of freedom. In 
that case, the (100 — a)% confidence limit constraint on 8 
corresponds to the set of all values 8 which result in values 
of x 2 such that 

rx 2 

/ d X 2 'P x 2(x 2 ') < (100 -q)% 
Jo 

More generally, one has a likelihood function £ (above repre- 
sented by V x 2 ) quantifying the goodness-of-fit between data 
and theory. One uses MCMC or APS to find the values of 
8 corresponding to £ > £n m , where £ii m is some threshold 
set by statistics and the desired (100 — a)%. 

Bayesian inference approaches this problem indirectly 
by assuming that 9 are random variables distributed ac- 
cording to the distribution function £. Constraining these 
parameters to (100 — a)% is therefore a problem of integrat- 
ing £ over 9 until the integral contains (100— a)% of the total 
£. MCMC assumes this approach, as will be shown below. A 
common criticism of this mode of thought is that, even if the 
theory is a poor description of the data, Bayesian inference 



will still yield a constraint consisting of the (100 — a) % least- 
poor region of parameter space. Frequentist inference takes 
a more objective approach, defining the (100 — a)% confi- 
dence limit to be all points in parameter space that satisfy 
£ > £n m . APS assumes this approach, as will also be shown 
below, and exploits it to derive parameter constraints with 
siginficantly fewer calculations of £ (a presumably costly 
procedure) than MCMC. 

2.1 Sketch of the MCMC algorithm 

The direct product of MCMC is a chain: a list of points in 
parameter space that MCMC sampled and the number of 
times it sampled them. 

• (1M) Begin by sampling a point 9i from parameter 
space at random. Record this point in the chain and 
evaluate the likelihood function for that combination of 
parameters. The value of this likelihood call will be Ci. 

• (2M) Select another point 92 in parameter space. This 
point should be selected randomly, but according to some 
distribution that depends only on the step length \9i — 82] 
and is tuned so that the algorithm explores away from the 
initial point but does not effectively jump to any other 
place in parameter space uniformly at random. Evaluate 
the likelihood function at 82, giving you £2. 

• (3M) If £2 > £1 (i.e., if the new point in parameter 
space is a better fit to the data than the old point), the 
step is accepted and the new point is recorded in the chain. 
Set £1 = £2 and 81 — 82- 

• (4M) If £2 < £1 , draw a random number /3 between 
and 1. If P < C2/C1, accept the step anyway. 

• (5M) If the step was ultimately rejected, increment the 
number of samples associated with 81 by one. 

• (6M) Return to step (2M) and repeat. There are heuris- 
tic statistics that can be performed on the chain to give an 
indication of whether it has adequately sampled parameter 
space (Brooks and Gelman 1998)) . though they have issues 
we point out later. 

This is the most basic form an MCMC algorithm can 
take. In addition to Chapter 1 of (|Gilks et al. 1996)) . one 
may wish to consult appendix A of ([Lewis and Bridle ~2002[) 
for guidance writing a functional MCMC. 

The tests in step (3M) and (4M) mean that any MCMC 
run will gravitate towards the high-likelihood regions of pa- 
rameter space and ignore regions of extremely low likeli- 
hood. One converts the chain into parameter inferences by 
gridding parameter space and treating the chain as a his- 
togram of how many times the chain visited each point 
on the grid (or its proximity). This histogram will closely 
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match the Bayesian posterior distribution over the parame- 
ters. (100 — a)% confidence limits are derived by integrating 
this histogram from its peak out to whatever bounds contain 
(100 — a)% of the sampled points. 

There are several shortcomings with this method that 
the APS addresses. The first shortcoming is efficiency. Be- 
cause MCMC seeks to take samples that match the distri- 
bution rather than learning specific information about the 
distribution, such as where the confidence boundaries are, 
it wastes samples in areas where it already has informa- 
tion. For example, once a high-likelihood parameter setting 
has been evaluated there is no need to evaluate there again. 
It is already known that this is a high-likelihood region of 
the space. However, MCMC specifically indicates that high- 
likelihood by putting more samples there. These are wasted 
evaluations. 

A second MCMC shortcoming, related to the first, in- 
volves the possibility that there are several disjoint regions of 
high likelihood in the parameter space under consideration. 
This can be problematic for MCMC because steps (2M)- 
(4M) ensure that, once the chain finds one of these high 
likelihood regions, it is unlikely to step out of it and find 
another (unless the distribution in step (2M) is such that it 
allows very large steps, in which case the chain will take sig- 
nificantly longer to converge). An unexplored region of the 
space may remain unexplored for a long time (though in the- 
ory not infinitely long) . MCMC never determines whether it 
has explored the region and whether information could be 
learned by doing so. 

When initially testing APS , Bryan et al. found a sep- 
arate region of high-likelihood parameter space in the 1- 
year WMAP data release which had been totally ignored by 
MCMC analyses. This second peak in £ disappeared as the 
WMAP signal-to-noise improved in subsequent data releases 
(as we will see in Section [5]) , but it does serve as an illustra- 
tion of this particular peril of MCMC. Dunkley et al. (2005) 
and (Atchade and Rosenthal 2005) attempt to address this 
problem and the general inefficiency of MCMC as a search 
algorithm by optimizing the proposal density in step (2M). 
They find no clear solution that guards against multiple high 
likelihood regions. 

The final shortcoming of MCMC involves its unequivo- 
cally Bayesian nature. For a number of data points Nd 3> 
1 Bayesian and frequentist confidence limits are approxi- 
mately equivalent. They can, however, yield conflicting re- 
sults when constraining a subset of parameter space. Con- 
sider a six-dimensional parameter space like the space of cos- 
mological parameters mentioned in Section [1] and revisited 
in Section [5] If one is really only interested in constraining 
{61,62}, one still has to deal with the question of how to 
treat {#3, . . . ,6%}, since they presumably affect £ in a way 
that is non-trivially correlated with {61,62}- In Bayesian in- 
ference, one deals with this question by integrating the prob- 
ability distribution over the full range of the uninteresting 
parameters (marginalizing over them), i.e. the (100 — a)% 



confidence limit is the contour in {61,62} space such that 

/oc 
cfec(6) = 
- 00 

f6\f f@2f roG POO POO POO s \ 

/ dOi / dd 2 / dd 3 / ddi / dd 5 / d6 6 C (6) 

where {6u, #1/ }, {#2i, #2/ } represent the bounds of the confi- 
dence limit in the two-dimensional subspace of interest. This 
integration yields the two-dimensional ellipses of figures 4 
and 6 of (Lewis and Bridle 2002) (and most observational 
cosmology papers since). From a frequentist perspective, the 
integration in equation JTJ) is problematic. It risks drawing a 
bound that is either too restrictive or too inclusive. An over- 
restrictive bound could arise because the integral {T} weights 
points in {61,62} space according to the density of corre- 
sponding points in {61, . . . ,6s} space that are highly likely. 
If there are combinations of {61,62} that are highly likely 
only for an extremely limited set of {#3, . . . , 8§}, those points 
will be excluded from the (100 — a)% confidence limit even 
though they are strictly allowed according to the £ > £n m 
criterion. Bayesian inference does not give much thought to 
these excluded points because, if we consider the parameters 
6 to be random (which Bayesians do), the excluded points 
correspond to highly improbable values of {61, 62} precisely 
because they only agree with the data for such a limited 
set {63, . . . , 8q}. Frequentist bounds, consisting of any and 
all points for which £ > £n m , do not care how many dif- 
ferent points in the larger parameter space are acceptable 
for a given {61,62}- If just one point satisfies £ > £n m , the 
corresponding point in subspace is within the bound. 

An over-inclusive bound could arise because the integral 
|T} searches only for that bound which contains (100 — a)% 
of the total probability. If the model chosen to describe 
the data is universally poor, MCMC will still return a 
(100 — ct)% consisting of the least-poor region in parame- 
ter space without any obvious warning that the model is 
probably wrong. A frequentist bound would, in that case, 
return no (100 — a)% confidence bound, highlighting the 
model's deficiency. 

Traditionally, Bayesian assumptions are made because 
they do not require prior knowledge of the underlying like- 
lihood function. In order to draw a Frequentist confidence 
limit, one must be able to calculate £ii m before sampling any 
points in parameter space. Bayesian confidence limits only 
require knowledge of the relative likelihood between sampled 
points and thus can be revised with each new sample. This 
represents one advantage MCMC retains over APS. 

2.2 Active Parameter Search 

To address the shortcomings of MCMC, Bryan et al. Bryan 
et al. (2007b) propose the following alternative algorithm for 
exploring parameter space. 

• (1A) Generate some number Ns of starting points 
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uniformly distributed across parameter space. Evaluate £ 
at each of these points. Store them in {9i, Li}. 

• (2A) Use the points already evaluated to fit a surrogate 
function that estimates the likelihood and uncertainty in it 
for other, as yet unevaluated, candidate points. 

• (3A) Generate a large number Nc of candidate points 
also uniformly distributed over parameter space. Use the 
surrogate function to guess the value of £ at each of the 
candidate points. This guess will be fx. The uncertainty in 
your guess will be a. Section \2. 31 will describe one means of 
finding fi and a. The assumption is that estimating /j, and a 
is orders of magnitude less computationally expensive than 
finding the true £. 

• (4A) Select the candidate point with the maximum 
value of the statistic 

S = ccr - |Aim - Ml (2) 

where c is a parameter that has a similar effect as the 
length parameter in MCMC's proposal distribution. Small 
values will make it take more samples around already good 
(high £) points while large values will make it explore more 
agressively. Evaluate £ at the selected point and add it to 
the list of evaluated {#i,£i}. 

• (5 A) Repeat steps (2 A) through (4A). Confidence 
bounds are found by gridding the parameter space (though 
no integration or other accumulation is required). Conver- 
gence can be estimated heuristically by observing the size of 
changes in confidence bounds. 

Maximizing S in equation @ implies, to some extent, 
minimizing — £ii m |. This algorithm therefore seeks out the 
boundary of the (100 — a)% confidence limit, rather than 
its interior. This yields some efficiency improvements over 
MCMC. 

The dependence of S on a - the uncertainty of the pre- 
dicted candidate £ value - forces the algorithm to simul- 
taneously pay attention to unexplored regions of parame- 
ter space. We will see in Section 12.31 that, if a region of 
parameter space is unexplored, the value of a for a candi- 
date point chosen from that region will be very high. This 
algorithm therefore guards against the second shortcom- 
ing of MCMC (ignorance of disjoint regions of high likeli- 
hood) by explicitly stepping away from regions that are al- 
ready known to be near the (100 — a)% confidence limit and 
sampling points from poorly-sampled regions of parameter 
space. ( |Bryan 2007a[ | empirically compared this dependence 
on a to other information-theoretic dependences and found 
it to perform best. Following that reference, we set c = 1.96 
in equation ©. 

Step (4A) chooses points solely based on their own 
merit, rather than the ratio of likelihoods used in steps 
(3M) and (4M) of MCMC. This algorithm can therefore ap- 



ply a purely frequentist test to parameter space. Bounds 
are drawn in parameter space by examining the full set of 
points sampled and making a scatter-plot of those points 
which meet the £ > Cy im criterion, rather than by integrat- 
ing over the relative frequency with which the algorithm vis- 
ited different points in parameter space. This guards agains 
the final shortcoming of MCMC: the inherent subjectivity 
of Bayesian (100 — a)% confidence limits. 

2.3 Gaussian Processes 

Step (3A) of APS generates a random set of candidate points 
and uses {8i,d} data from the points in parameter space 
already sampled to predict the value of £ at each candi- 
date point (this prediction is fj, in equation [2]) and assign an 
uncertainty a to that prediction. The algorithm is agnostic 
about how this prediction and uncertainty are derived. We 
follow ( |Bryan et al. 2007b [ l and use the formalism of Gaus- 
sian processes (more specifically: Kriging) to make the pre- 
dictions. There is a slight difference between our formalism 
and theirs. This will be explained below. The following dis- 
cussion comes mostly from Chapter 2 and Appendix A of 
(Rasmusscn and Williams 2006). 

Gaussian processes use the sampled data {0i, £;} to pre- 
dict £ at the candidate point 9 q by assuming that the func- 
tion £(#) represents a sample drawn from a random process 
distributed across parameter space. At each point in param- 
eter space, £ is assumed to be distributed according to a 
normal distribution with mean £ and variance dictated by 
the covariance function dj(8i,8j). Cu is the intrisic vari- 
ance in the value of £ at a single point dj encodes how 
variations in £ at one point in parameter space affect vari- 
ations in £ at other points in parameter space. Rasmussen 
and Williams (2006) treat the special case £ = and finds 
(see their equations 2.19, 2.25 and 2.26) 

j-lq = ^^CqiC'ij Cj 

i,i 

°q = Cqq — ^^CqiCij Cjq (3) 

i,i 

where the sums over i and j are sums over the sampled 
points in parameter space. d q relates the sampled point i 
to the candidate point q. We do not wish to assume that the 
mean value of £ is zero everywhere. Therefore, we modify 
the equation for fj, to give 

ii^C + Y^Cq.C^^-C) (4) 

where £ is the algebraic mean of the sampled d. Note 
the similarity to a multi-dimensional Taylor series ex- 
pansion with the covariance matrix playing the role of 
the derivatives. Equation © differs from equation (6) in 
( |Bryan et al. 2007bp because they used the semivariance 

7«=w[£$)-£$)] 
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Figure 1. A onc-dimcnsional example of prediction using Gaus- 
sian processes. The black curve is the function being considered. 
The crosses are the points at which it has been sampled. The 
green curve is the resulting prediction and the red curves repre- 
sent the 2- and 3-<r uncertainty bounds. For the purposes of this 
illustration, we assumed that the Kriging parameter was K = 10. 



in place of the covariance Cy. In practice, the two assump- 
tions result in equivalently valid /i and a. 



The form of the covariance function CV 



assumed. We choose to use 



K exp 



2 3 



must be 



(5) 



where Dij is the distance in parameter space between the 
points 9i and 9j . The exponential form of dj quantifies the 
assumption that distant points should not be very corre- 
lated. The normalization constant K (known as the "Krig- 
ing parameter" for the geophysicist who pioneered the over- 
all method) also must be assumed. This is somewhat prob- 
lematic because, examining equation Q, one sees that the 
factors of K and K~ l completely factor out of the predic- 
tion fi, so that the assumed value of K has no effect on the 
accuracy of the prediction. If the opposite had been true, 
one could heuristically set K to maximize the accuracy of 
/i. Given that the function we are trying to model (£ as a 
function of set data and a specified theory) is not a random 
process, we find no a priori way to set K and instead set it 
according to heuristics that we believe are consistent with 
the behaviors we desire from the APS algorithm. We discuss 
this in more detail in Section [3] 

Figure [1] applies the Gaussian process of equations ([3]) 
and ([4]) with assumption ([5]) to a toy one-dimensional func- 
tion. Inspection shows many desirable behaviors in /x and 
a. As 8 q approaches the sampled points 6i, /j, approaches 
d and a approaches zero. Closer to a sampled point, the 
Gaussian process knows more about the true behavior of 
the function. Far from the 8i, a is larger, and the S statistic 
in equation ([2]) will induce the APS algorithm to examine 
the true value of C. 



2.4 Comparison of Algorithms 

Figure [2] directly compares APS with MCMC by running 
both on a toy model in two-dimensional parameter space. 
The model likelihood function is a x 2 -distribution with two 
degrees of freedom. We are interested in the 95% confidence 
limit, which corresponds to a criterion of 

C > Ata 



X <6 



The model is constructed so that there are two regions of 
parameter space that meet this criterion. These are the two 
ellipses in Figure 2(a) Figures 2(b) and 2(c) zoom in on one 



of these regions. The open circles correspond to the points 
sampled by MCMC. Each color represents an independent 
chain (four in all). Each of these chains was allowed to run 
until it had sampled 100 points in parameter space. The 
blue crosses represent the points sampled by APS after it 
had sampled a total of 400 points in parameter space. 

Note that no single MCMC chain found both regions 
of high likelihood. A user would have to run multiple in- 
dependent chains to be at all certain that she had discov- 
ered all of the regions of interest. It is now standard to ad- 
dress this by running several MCMC chains and aggregating 
their outputs, but it is concerning to leave coverage of the 
search space to the luck of multiple random restarts. Con- 
versely, APS sampled points from distant regions of parame- 
ter space. This is how APS ultimately found both regions of 
high likelihood. Indeed, after only 85 calls to the likelihood 
functio,APS had sampled points from both regions of high 
likelihood. 



Consider Figures 2(c) and 2(b) It is obvious that, even 



near the most densely-sampled high-likelihood region, APS 
is principally interested in points on the boundary of the 
confidence region, while MCMC samples the interior. This 
is another way in which APS is a more efficient allocation 
of computing resources than MCMC. 

Finally we can observe the perils of common MCMC 
convergence heuristics. They generally compare statistics 
from all the chains in aggregate against individual chains. At 
convergence, these statistics should be similar. We observe 
that in the toy example, convergence has not happened be- 
cause some chains are near one local maximum while some 
are at the other. However, from a practial point of view all 
the high likelihood areas have been identified. Conversely, if 
the four chains had been chosen unluckily, they would have 
all converged to the same local maximum and never reached 
the other. The convergence tests would have reported suc- 
cess while the reality would have been failure to find both 
local maxima. We will present a more detailed comparison 
of MCMC and APS in Section H 



3 TUNABLE PARAMETERS 

APS as presented thus far contains parameters which must 
be set by the user. We describe them in this section. A sum- 
mary list is presented at the end of the section. 
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(c) 

Figure 2. Points sampled by MCMC and APS on a toy likelihood function with two disjoint regions of high likelihood (the large 
ellipses) in a two-dimensional parameter space. Open circles are points sampled by MCMC. Each different color is an independent chain. 
Each chain is run until it has sampled 100 points. The blue crosses are the points sampled by a single run of the APS algorithm run 
until it has sampled 400 points. Figurcs |2(b)| and |2(c)| zoom in on one of the high-likelihood regions to demonstrate that points sampled 
with the APS algorithm do not tend to cluster in the center of the high likelihood region as points sampled by MCMC tend to do. Also 
note that no single MCMC chain finds both regions of high likelhiood. APS does find both. 



Nc is the number of candidate points considered at each 
iteration. One consideration that should go into choosing Nc 
is speed. If Nc is too large, the evaluation of all Nc values of 
fi and a in step (3A) will become comparably expensive to 
evaluating the likelihood function and the algorithm will lose 
some of its efficiency advantage over MCMC. Nc can also 
affect the algorithm's propensity to explore unknown regions 
of parameter space. A small Nc adds additional randomness 
because the selection will be affected more by the luck of 
choosing candidates than the metric used to evaluate them. 

Ns is the number of purely random points on which 
to sample C before proceeding to iterate steps (2A)-(4A). 
This number need only be sufficiently large to make the 
initial Gaussian process reasonable. It should not be so large 
as to be equivalent to a fine-gridding of parameter space, 



which would defeat the purpose of having an efficient search 
algorithm. 

In practice, one must specify bounds on each of the pa- 
rameters beyond which it is not worth exploring. To prevent 
the relative sizes of each parameter's allowed range from af- 
fecting the performance of our Gaussian process, we amend 
the covariance function © to read 



-r 



(6) 



where i and j denote different points in parameter space 
and the sum over a is a sum over the dimensionality of 
the parameter space. Additionally, in order to prevent the 
Gaussian process from becoming prohibitively slow as more 
points are added to the set of sampled points, we follow 
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( |Bryan et~ i l. 2007b) and only use the Ng nearest sampled 
neighbors of each candidate point when predicting fi and a. 
Ng is another choice made by the user. 

To set the Kriging parameter K, we sample an addi- 
tional uniform set of Nk points after the initial Ns sample 
but before proceeding to step (2A). For each of these points 
we both predict /i using the Gaussian process and sample 
the actual £. We set K equal to the value necessary that 
68% of these Nk points have £ — a < fi < £ + a. As the 
algorithm runs, we periodically adjust K so that the search 
through parameter space strikes a balance between exploring 
unknown regions of parameter space and identifying regions 
of parameter space satisfying £ ~ £ii m - Note that one can 
just as easily assume the value of K, this, however, runs 
the risk that the APS algorithm will either fail to explore 
outside of the discovered high-likelihood points (if K is too 
small and cr < |/i - £n m |) or will ignore the high-likelihood 
region altogether (the opposite case). 

We make one final modification to the APS algorithm as 
currently presented. Because of the absolute nature of Fre- 
quentist parameter constraints, a point in parameter space 
with likelihood £ — 0.95 x £n m is still not within the confi- 
dence limit. If the code finds such a point and immediately 
resumes its random search through the broader parameter 
space, it has not learned anything about the allowed val- 
ues of 0. We therefore modify the code so that, whenever it 



finds a point with Xu. 



< x 2 < 1.1 x xL 



it pauses in its 



search and spends some number Nl of evaluations trying 
to find a nearby point that is actually inside the confidence 
limit. To conduct this refined search, we borrow an idea 
from (Vandcrplas and Connolly 2012) and note that equa- 
tions Q and © offer a straightforwardly differentiable func- 
tion for fi in terms of our theoretical parameters. Much as 
equation ([4]) does a good job of characterizing the value 
of £ at an unknown point, the derivative of equation Q 
should do a good job of characterizing the derivative of £ 
at a known point. This allows us to use gradient descent to 
walk from a point near the likelihood threshold towards a 
point that is inside the likelihood threshold. The method is 
as follows. 

• (lg) Starting from the already sampled point 6 q 
which is near the likelihood threshold, assemble a list of 
the Ng nearest neighbors from the set of sampled points, 
this time including 9 q itself as the absolute nearest neighbor. 

• (2g) Differentiate equation ((4]) to get 
and differentiate equation © to get 



dC q ,t 

de a e 



X C a 



• (3g) Use this derivative to select a point in parameter 
space a small step away from 6 a along the direction that will 
maximize the change in £. Sample the likelihood function 
at this point. This point is now 6 q . 

• (4g) Repeat steps (lg) through (3g) until you find some 
fiducial maximum likelihood £ max or you have iterated Nl 
times, whichever comes first. 

This modification to APS is referred to as the "lingering 
modification." 

To summarize, the APS parameters that must be tuned 
by the user are (values in parentheses are the values used on 
the 6 dimensional parameter space in Section [5| 

• Ns (1000) - the number of random distributed starting 
points evaluated in step (1A) of APS . 

• Nk (1000) - a number of randomly sampled points used 
to heuristically set the Kriging parameter K in equation ©. 

• Nc (250) - the number of candidate points randomly 
generated in step (3A) of APS . 

• Ng (15) - the number of nearest neighbors used in the 
Gaussian Process. 

• Nl (100) - the maximum number of iterations to be 
spent using gradient descent in the lingering modification. 

• Xumi Ctextlim (x 2 — 1281) - the confidence limit 
threshold value APS is trying to find. 

• Xmini £max (x 2 = H97) - the value used as a target by 
gradient descent in the lingering modification. 

• The dimensionality of the parameter space and the 
maximum and minimum values each parameter is allowed 
to take. 



4 A TOY COMPARISON 

In this section, we will test APS 's performance against that 
of MCMC on a toy likelihood function with two regions of 
high likelihood. The function exists in four dimensions. The 
high likelihood regions are centered on the points 

0i = {-5.0, 0, 0, 0} 02 = {5.0,0,0,0} 

The function itself is characterized by a x 2 statistic with 



1199 degrees of freedom, which depends on according to 



dl , dl 



x \e) = i3oo.o + ^ + ^ 



max, a "min,i 



+ 1147exp[-d?] + 1200exp[-0.5d2] 

where di^ are the distances in parameter space from 01,2. 
In this case, the 95% confidence limit threshold corresponds 
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Figure 3. A one dimensional slice of our toy likelihood func- 
tion. The red dashed line corresponds to the 95% confidence limit 
threshold for a x 2 distribution with 1199 degrees of freedom. 



to a value of x 2 = 1281. Fi gure [3] shows a one-dimensional 
slice of this function. The red dashed line corresponds to the 
threshold x 2 limit. When testing APS , we set the tunable 
parameters thus 



N s 


= 20 


N K 


= 20 


N G 


= 15 


N C 


= 250 


N L 


= 50 


2 

Xmin 


= 1197 



Before running either MCMC or APS , we generate 
three test grids each of 10,000 known points on this like- 
lihood surface. We will use these grids to measure how long 
it takes MCMC and APS to learn the entire behavior of this 
likelihood surface. The first test grid is uniformly distributed 
across the entire likelihood surface (all 6 a parameters are 
allowed to vary from —10 to 10). The second grid is spheri- 
cally distributed about the rightmost high likelihood region 
in Figure [3] The third grid is spherically distributed about 
the leftmost high likelihood region. The first grid contains 
no points whose x 2 values are below the threshold limit. The 
second and third grids are roughly half comprised of points 
that are below the threshold. 

To compare the efficacy of APS and MCMC at char- 
acterizing the likelihood surface, we run each algorithm 200 
times. In the case of MCMC an individual "run" consists of 
four independent chains run in parallel. During each run, we 
periodically stop and consider the points sampled by each 
algorithm thus far. We treat these points as the input to a 
Gaussian process which we use to guess the x 2 values of the 
points on our three test grids. We quantify the efficacy of the 



algorithms in terms of the number of mischaracterizations 
made on each grid. We define a "mischaracterization" as a 
point on the test grid which satisfies x 2 < Xum bnt for which 
the Guassian process predicts \ 2 > Xum or vice-versa. Fig- 
ure^ shows the performance of the algorithms on this test 
averaged over all the 200 instantiations of each algorithm. 
You will note that while both algorithms were essentially 
perfect at characterizing the widely distributed test grid 1 
(on which no points satistfied x 2 < Xumi note the logarith- 



mic vertical axis in Figure 4(a) I, only APS with lingering 



successfully and reliably characterized both test grids cen- 
tered on the high-likelihood regions in Figure This is an 
illustration of the second shortcoming of MCMC identified 
in Section 12.11 once an MCMC chain has identified a high 
likelihood region, it is unlikely to step out of that region and 
consider the possibility that other high likelihood regions ex- 
ist. This problem persists even though each MCMC instanti- 
ation consists of four independent chains, each with its own 
opportunity to fall into one or the other high likelihood re- 
gion. Either the chains all fell into one high likelihood region 
and not the other, or the chains became trapped at the local 
minimum in equation (TJl at 6 — {0, 0, 0, 0}. In Section[5]we 
perform a similar test using actual data from the WMAP 
CMB experiment (|Jarosik et al. 20111 IWMAP 2010)) and 
the 6-dimensional parameter space of the flat concordance 
cosmology. 



5 TEST ON WMAP 7 YEAR DATA 

In this section, we test the APS algorithm on the 7-year 
data release of the WMAP satellite, which measured the 
temperature and polarization anisotropy spectrum in the 
cosmic microwave background fjarosik et al. 2011[) . For sim- 
plicity, we only consider anisotropies in the temperature- 
temperature correlation function and modify the likelihood 
function fWMAP 2 010[) to work in the space of anisotropy 
CVs, rather than working directly in the pixel space. This re- 
sults in a likelihood function sampled from a x 2 -distribution 
with 1199 degrees of freedom. In this case, the C > Aim cri- 
terion for the 95% confidence limit corresponds to x 2 < Xum 
with x 2 m = 1281. 

We take as our parameter space the six dimen- 
sional parameter space describing the set of spatially 
flat ACDM (cosmological constant and cold dark mat- 
ter) concordance cosmologies. Those parameters are 
{Q^h 2 , QcdmIi 2 , h, t, n s , ln[10 10 A]}. These parameters will 
be familiar to users of the MCMC code CosmoMC 
(|Lewis and Brid le 2002 ) as the relative density of baryons, 
the relative denstiy of dark matter, the present day Hubble 
parameter, the optical depth to last scattering, the spec- 
tral index controlling the scale-dependence of primordial 
density fluctuations in the universe, and a normalization 
parameter controlling the amplitude of primordial density 
fluctuations in the universe. We use the Boltzmann code 
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Figure 4. The performance of MCMC and APS at characterizing the points on the test grids built around the likelihood function in 
Figure [3] Solid lines are the mean number of mischaractcrizations averaged over 200 instantiations of the algorithm. Dashed lines are 
the mean plus one standard deviation. Dotted lines are the mean plus two standard deviations. Black lines are MCMC. Red lines arc 
APS without the lingering modification. Blue lines are APS with the lingering modification. The APS lines do not appear in Figure [4 (a) | 
because neither version of APS resulted in any mischaracterizations. Recall that each test grid contained 10,000 points. 



CAMB to convert from parameter values to anisotropy spec- 
tra ( |Lewis, Challinor, and Lasenby 2000[ ). 
In this case, we test APS with 



N K 
N c 
N G 
N L 

Xlim 
2 

Xmin 



1000 

1000 

250 

15 

100 

1281 

1197 



Our comparison MCMC is run by the publically available 
software CosmoMC (I Lewis and B ridle 2002}. We compare 
the results of the two algorithms - both in terms of derived 
constraints on the cosmological parameters and in terms of 
exploration of the full parameter space - below. 



5.1 Parameter Constraints 

As discussed in Section [2j MCMC determines parameter 
constraints by integrating over the posterior probability dis- 
tribution on parameter space while the APS algorithm de- 
termines constraints by listing found points which satisfy 
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C > £iim and determining the region of parameter space 
spanned by those points. Figure [5] compares these two ap- 
proaches by tracking the development of the 95% confidence 
limit contour in one two-dimensional slice of our full six- 
dimensional parameter space as a function of the number of 
points sampled by each algorithm. In each frame, the solid 
contours represent contours drawn by considering all of the 
points found by each algorithm which satisfy the Frequentist 
C > Aim requirement. The blue contour represents points 
found by APS . The red contour represents points found 
by MCMC. The black contour represents points found by 
MCMC after sampling a total of 1.2 million points. Note: 
for this comparison, we consider all of the points visited by 
MCMC, not just those points accepted in steps (3M) and 
(4M) from Section [2.11 This is, in some sense, a more com- 
plete set of information about the likelihood surface than 
MCMC usually returns. The dashed black contour is the 
contour drawn using the usual Bayesian inference on the 
points acccepted by MCMC after the full 1.2 million points 
have been sampled. The green crosses are the pixels in this 
2-dimensional parameter space that Bayesian inference be- 
lieves are inside of the 95% confidence limit after the speci- 
fied number of points have been sampled. 

The salient features of this figure are as follows. The 
solid black contour is larger than the dashed black contour. 
This means that MCMC visited points that met the Fre- 
quentist threshold £ > £n m but not with enough frequency 
to satisfy the 95% Bayesian confidence limit. This is an ex- 
ample of MCMC being too restrictive as discussed in Section 
12.11 A similar conclusion can be drawn from the fact that 
the green crosses do not fill the red contour until 400,000 
points have been sampled. 

The green crosses congregate in the center of the con- 
tours because MCMC is principally interested in the deep 
interior of the high likelihood region. This is a manifestation 
of the inefficiency of MCMC discussed in Section [27T] 

The blue and red contours track quite well at all points 
in the algorithms' histories. This shows that APS is at least 
as good as MCMC at deriving parameter constraints when 
you treat the points visited by MCMC from a Frequen- 
tist perspective. Comparing the blue contour to the green 
crosses, in Figure 5(c) one sees that APS derives accurate 



400,000 points sampled 



parameter constraints faster than MCMC treated from a 



Bayesian perspective. Figure [S] recreates Figure 5(d) except 



that the green crosses represent the 50% confidence limit ac- 
cording to Bayesian MCMC after sampling 400,000 points. 
The fact that these pixels still occur outside of the solid 
black contour (Frequentist MCMC 95% confidence limit af- 
ter sampling 1.2 million points) indicates that the false pos- 



itives in Figure 5(d) represent a significant fraction of the 



total posterior probability integrated by Bayesian MCMC 
at this point in the algorithm. In contrast, the APS con- 
tour already covers 86% of the final area of the Frequentist 
MCMC contour. Because of how they are drawn (£ > £n m ), 
APS contours will not include false positives. 

Figure [7] plots the one-dimensional 95% confidence lim- 




Bayesian MCMC (50% CL) 
Frequentist MCMC 
APS 

Frequentist MCMC (final) 
Bayesian MCMC (final) 



0.5 



Figure 6. This figure recreates Figure |5(d)| except that now 
the green crosses are the 50% Bayesian MCMC confidence limit. 
The fact that these crosses still occur outside of the solid black 
contours indicates that false positives account for a large fraction 
of the total posterior probability integrated by MCMC, even after 
sampling 400,000 points in parameter space. 

its on our six cosmological parameters. Again, the solid red 
lines consider all of the points visited by MCMC (not just 
those accepted by the chain) and set the limit according to 
the Frequentist threshold. The blue lines represent the re- 
sults from APS. Note that APS only sampled 420,000 points 
before we stopped it. MCMC sampled 1.2 million points. 
The dashed black lines are the confidence limits inferred 
from Bayesian analysis on only those points accepted by the 
MCMC chains. As in Figure [5] we see that APS converges 
to the same answers as Freqentist MCMC in comparable 
time, and that both Frequentist analyses find allowed pa- 
rameter values that are missed by the Bayesian analysis. 
Cases where the dashed black lines stry outside of the solid 
red lines indicate Bayesian anaysis applying spurious weight 
to low-likelihood points. 

At this point, we have made the case that APS gives 
more accurate parameter constraints in less time than the 
usual, Bayesian MCMC analysis. However, even if one were 
simply to modify their MCMC analyses to adopt a Frequen- 
tist perspective (the red contours and lines in Figures [S] and 
O, we show in Section \5 . 2 1 below that APS exhibits superior 
performance characterizing the entire likelihood surface, not 
just the high likelihood subsurface. In the language of Sec- 
tion [4] APS gives constraints as accurate as MCMC with 
greater confidence that you have not ignored any additional 
regions of high likelihood. 

5.2 Exploration of Parameter Space 



Section \5 . 1 1 examined the performance of the APS algorithm 
within the high likelihood region of parameter space by com- 
paring the derived parameter constraints to those found by 
MCMC. This section will consider the performance of the 
APS algorithm in the low likelihood region of parameter 
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Figure 5. The 95% confidence limit contours in a two-dimensional slice of the full six-dimensional cosmological parameter space 
as determined by APS and MCMC at different stages in the algorithm's progression. The solid curves represent contours drawn by 
considering all of the C > Aim points found by each algorithm. The solid red curve is MCMC after sampling the specified number of 
points. The solid blue curve is APS after sampling the specified number of points. The solid black curve is MCMC after sampling 1.2 
million points. This will be taken to be the truest answer. The dashed black curve represents the contour drawn using the usual Bayesian 
inference on the full 1.2 million-point MCMC chain. The green crosses represent the pixels that Bayesian inference declares to be inside 
of the confidence interval after MCMC has sampled the number of points specified. 



space, asking the question "how certain can we be that the 
excluded regions of parameter space contain no points of 
interest?" Recall the toy model in Section 0] MCMC is no- 
toriously inefficient (or even ineffective) at exploring multi- 
modal likelihood functions. By selecting sample points based 
both on proximity to the confidence limit and on uncertainty 
in the Gaussian process prediction a, the APS algorithm 
ought to improve on that performance. 

To test this hypothesis, we perform the test illustrated 
in Figure [4] on the WMAP parameter space and likeli- 
hood surface. We generate two test grids each of 1,000,000 
points. One grid is distributed uniformly across parameter 
space. This grid contains no good points that satisfy the 
£ > Aim criterion. The other grid is spherically distributed 
about the vicinity of the high likelihood region and contains 
110,000 good points. We then take the points sampled by 



MCMC and APS (again, we take all of the points sampled 
by MCMC; not just those points accepted in the chains) 
at different times in their history and use those points as 
the input for a Gaussian process, which we use to predict 
the £ values of the points on our test grid. Figure [5] shows 
the number of points that are thus misclassified (£ > £ii m 
points for which the Gaussian process predicts £ < £n m and 
vice- versa) as a function of the number of points fed into the 
Gaussian process. Because the WMAP 7 likelihood function 
is so expensive, we only ran this test once. To find the confi- 
dence limit (dashed) curves, we consider the uncertainty im- 
plied by the Gaussian process a in equation Q. The dashed 
curves encompass points that are within a of being misclas- 
sified (i.e. points for which x 2 < Xnm but fi + a > Xiim an d 
vice versa). 



As you can see from Figure 



APS does a much 
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Figure 7. These figures plot the 95 % confidence limit bounds on our cosmological parameter set as derived by MCMC and APS as a 
function of the number of points sampled. The red solid line treats the output from MCMC using a Frequentist perspective. The blue 
solid line treats the output from APS (also as a Frequentist). The dashed black line treats the output from MCMC using the usual 
Baycsian perspective. 
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better job at characterizing the uniform test grid than does 



MCMC. Figure 8(b) shows that the two algorithms perform 
comparably poor on the high likelihood test grid. This is 
likely due to the compact nature of the high likelihood re- 
gion of the WMAP 7 likelihood surface. It is small both in 
terms of extent on parameter space and in terms of the dif- 
ference in x 2 between likely and unlikely points. Recall that 
the 95% confidence limit we are considering corresponds to 
X 2 = 1281. The smallest x 2 found by either algorithm was 
X 2 ~ 1270. This small difference between likely and unlikely 
points means that even a fraction of a percent error in the 
value of x 2 predicted by our test grid Gaussian process will 
result in a mischaracterization. Figure [9] shows that this is 
indeed what happens. Here we plot the fractional error in 
predicted x 2 as a function of actual x 2 f° r both algorithms. 
Red curves are MCMC. Blue curves are APS . Dotted curves 
are results after sampling 50,000 points. Dashed curves are 
results after sampling 100,000 points. Solid curves are results 
after sampling 400,000 points. As you can see from Figure 



9(a) there is indeed imprecision on the order of 1% when 
considering test points near the 95% confidence limit thresh- 
old. On a more forgiving likelihood surface with greater A^ 2 
between likely and unlikely points, this would not result in 
the large number of mischaracterizations evident in Figure 



8(b) The WMAP 7 likelihood surface is anything but for- 
giving. 

Readers interested in seeing how APS learns about the 
likelihood surface over time can consider Figures [9(b)] which 
recreates Figure 9(a) for a broader expanse of x j an d Fig- 



ure [TD] which shows the fraction of mischaracterized points 
on both test grids as a function of x 2 values. As you can 
see, while APS learns rapidly about the unlikely regions of 
parameter space, MCMC remains largely oblivious to what 
is going on in the regions outside of its integral bounds. 

These results, combined with the parameter constraints 
illustrated in Section 15.11 cause us to conclude that APS 
is at least as effective as MCMC and locating regions of 
high likelihood parameter space, and is significantly more 
robust against anomalies in regions of low likelihood param- 
eter space. 



space. MCMC is robust only in the high likelihood region 
it happens to discover. 

• (iiia) APS allows investigators to apply frequentist as- 
sumptions to their parameter constraints. 

The shortcomings of APS are twofold. 

• (ib) APS has no framework for exploring a likelihood 
function whose form is not known (this is the corrollary to 
(iiia) above). You must specify Cum or Xiim before running 
APS. You cannot use APS to discover £ii m . 

• (iib) There is no well-accepted stopping criterion for the 
algorithm equivalent to the convergence criteria usually ap- 
plied to MCMC (for example, (|Brooks and Gelman 19980 ). 
However, as we observed on the toy problem and as Bryan 
et al. observed by finding a second high likelihood area in 
the WMAP data, MCMC's convergence provides a false 
sense of security. 

• (iiib) APS is much more complicated to implement than 
the most basic MCMC. We hope that by making our code 
publically available, we can help the community to overcome 
this hurdle. 

APS may be used as a more efficient alternative to 
MCMC. They may also be combined. Often, the conver- 
gence of MCMC chains is dependent on the size of pa- 
rameter space to be explored. The larger the region, the 
slower convergence. Investigators can exploit advantage (ia) 
by pre-processing their data with APS and using the discov- 
ered high likelihood regions to set the prior bounds for their 
MCMC analyses. 

We make our code available at 
https://github.com/uwssg/APS. The code is presented 
as a series of C++ modules with directions indicating 
where the user can interface with her particular likelihood 
function. Those with questions about the code or the 
algorithm should not hesitate to contact the authors. 



6 DISCUSSION 

APS as demonstrated here has three principal advantages 
over MCMC when deriving parameter constraints. 

• (ia) APS is more computationally efficient than MCMC 
in that it does not spend time exploring the interior of high 
likelihood regions when only their bounds are of interest. 
As a result it can yield comparable parameter constraints 
with significantly fewer calls made to expensive likelihood 
functions. 

• (iia) APS allows for simultaneous robust statements 
about both high and low likelihood regions of parameter 
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Figure 8. The same test we performed on our toy model in Figure [4] this time using the WMAP7 parameters and likelihood function. 
The "uniform test grid" is distributed uniformly across parameter space and contains no good points that satisfy C > £iim- The "high 
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Figure 10. These plots combine both of the test grids from Figurc[8]and show the fraction of points misclassificd by the test Gaussian 
process as a function of their true x 2 value. Red curves consider points sampled by MCMC. Blue curves consider points sampled by APS 
. Dashed curves are as in Figure |H] 
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